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I. Introduction 

Since tests have a powerful directive iitifluence on teaching and the 
study of pupils, a major policy to follow is to establish a testing 
program that faithfully reflects the objectives sought by the school. In 
this way the influence of testing is to reinforce the objectives sought 
by the school. 

Ralph Tyler, "What Testing Does to Teachers and Students/ 1959, 
In Anastasi, 1966, p. 49. 

Standardized testing in the schools is on the increase. As Pipho (1953, 
p. 1 9) ot>8erved "nearly every large education reform effort of the past few 
years has either mandated a new form of testing or e^mded uses of 
existing tests." The increasing prominence of testing over the last five years 
is linked directly to efforts to reform education, particularly at the state 
level. For example, a 50«-3tate survey of reform measures conducted by 
Education Week found that* 29 states required competency tests for 
students, and 10 other states had such a requirement under consideration i; 
If states required an 03dt test for graduation, 4 additional states had such a 
measure under consideration; d states employed a promotional "gates" test , 
while 3 others were considering such a mandate; finally, 37 states had some 
sort of state assessment program, and 6 additional states had such a program 
under consideration. This growing use of tests in the policy sphere by 
agencies external to the local education agencies (LEAs), we will argue, is 
having increasing impact on what is taught and learned in schools. 

As a means of documenting this increasing attention to testing, and 
contrasting it with curriculum concerns, we charted the amount of space in 
Education Index over the last 30 years devoted to citations concerning 
testing and curriculum. As shown in Figrjire 1, the average annual number 
column inches devoted to citations concerning curriculum has increased only 
modestly over the last half -century — from 50 - 100 inches per year in the 



Figure I: Education Index Listings Under Testing and Curriculum 1930-1985 
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19308 and 403 to only 100 - 150 in recent years. In contrast, attention 
devoted to testing in Education Index has increased dramatically, from only 
10-30 column inches in the 1930s and 19408 to well over 300 inches in the 
19608. 

Measuring column inches in Education ind^g is, ci course, a fairly 
crude way of charting what is happening in the world of education, but these 
data certainly suggest a trend that we suspected even before looking at the 
Education IndeiL It is that increasingly, standardized testing seems to have 
become the coin of the educational realm. In recent years, it seems that the 
aims of education and the business of our schools are addressed not so much 
in terms of curriculum - the courses of study that are or should be followed - 
- as in terms of what gets tested. Th'* data from tM Education Index. 
showing that the relative attention to curriculum and testing issues has 
undergone a ten-fold change in the last 30 years, cieady suggests this. 

Before reviewing arguments and evidence bearing on the impact of 
testing, we need to comment briefly on what is meant by the term 
'standardized tests." Essentially by this term we refer to standardized 
. achievement tests of reading and math skills, induding state- and iocal- 
education-agency sponsored standardized tests, commei dally developed 
nationally normed tests, and tests that have sometimes been called basic 
skills or minimum competency tests. Under the rubric of 'standardized 
achievement tests" one might even include tiie Scholastic Aptitude Test 
(SAT) which is, we think, essentially a general achievement test of vert>al 
and math skills ratheir than a test of aptitude (though we will not in this 
brief presentation attempt to delve iiste the controversy over the 
aptitude/achievement distinction, either in general or with respect to the 
SAT). 

Er|c 5 
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By way of introduction it should be noted that such tests may be used 
for a variety of purposes, including general evaluation of schools and 
programs, diagnosis of student strengths and wealcnesses, student grade 
promotion, high school graduation, guidance, college admissions and even 
teacher evaluation. However, despite the Initial purposes of either test 
developers or sponsors for achievement testing, , once a test is administered 
the results often get used for other quite different purposes. Perhaps the 
most prominent e];ample of this is the way in which college admissions tests, 
despite occasional protestations by test sponsors, have come to be used as 
general indicators of the educational health of states, and the nation as a 
whole ( in this regard see, for escample, the United States Office of Education's 
annual Wall Chart, or the recent study. Trends in Achievement, by the 
CongressioncJ Budget Office of the U.S. Congress, 1966. Both use SAT and 
ACT test data as tm sources of evidence regarding national trends In 
achievement). 

Thus, in this brief paper we will not attempt to distinguish between 
the different, ostensible, commorily recognized, purposes of standardized 
te&ts. Instead we will attempt to set out some fairly broad ideas about the 
effects of testing on education, and on curriculum in particular; offering some 
ideas on the past and future of the National Assessment of Educational 
Progress (NAEP). Specifically the foUowlng four sections of this paper are 
organized around these topics: 

II. Conditions of Testing Affecting Its Impact 

III. Seven Principles Regarding the Impact of Testing 



IV. NAEP as an Instrument for Informing Educational Policy; and 
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V. The Future of NABP 

11. Conditions of Testing Affecting Its Impact 

Rather than commenttng on the myriad specific ostensible uses of 
testiti^^, ! istead, In this section, wo offer some general ot>servation8 on four 
t)road issue that condition the impact of testing: (1) What is tested, (2) How 
scores are referenced; (3) The source of testing; and (4) The rewards or 
sanctions assodabed with test results 

2.1 Wliati8t*8t«<l 

A wide range of variables has been the subject of measurement, 
though the main emphasis ^a8 been on the measurement of cognitive rather 
than affective characteristics. The cognitive variables which have attracted 
most attention are "intelligence", and achievement in basic sldll areas of the 
curriculum (reading , arithmetic). As one proceeds up the educational ladder 
inti) secondary schools, where instruction is organized around subject 
matter/content areas, rather than around specific skills, commercially 
available test batteries become less specific and less related to what is 
taught High school test batteries closely resemble elementary school 
batteries in that they are more oriented to the basic slcills of numeracy and 
literacy than to what is taught in specific subject fields like math, physics, 
history, Er gUsh literature, etc. As a result, such test scores are less relevant 
to the work of the high school teacher. However, some of the recent reform 
reports call for the development of ejcams for specific secondary school 
curriculua areas. This has profound implications for curriculum and 
instruction at that level. This is because in general, it seems that other 
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things being equal, the impact of testing is greater when tests are keyed to 
specific courses. 

2. How 8Cor<»8 ar* r«f«rMic«<l. 

The main distinction here is between norm-referenced tests, on which 
performance is assessed by reference to the performance of o^er students, 
and criterlon-roferenced tests, on which performance is assessed by 
reference to the mastery of specific content domains. It should be noted tliat 
norm-referenced information can also be structured to provide criterion- 
referenced interpretations and vice versa. While criterion-referenced tests 
are increasingly hailed as superior to norm-referenced ones in terms of 
information provided to teachers, norm-referenced information is valuable 
for comparative purposes. Further, the specificity of criterion-referenced 
information from commercially available tests, relative to what is actually 
taught at the local level can often be dubious. 

The general point to be made here is that however tests are 
referenced, if they are poorly matched to what is taught in schools, and If 
they are linked to important decisions, they can have great impact for good 
or ill. 

5* Internal vs. oztMnal testing programs. 

An important factor regarding the impact of testing is its source. The 
main distinction h^re is between internal testing, and external testing. An 
internal testing program is one which is carried out within a school at the 
initiative and under the control of the school superintendent;, prindj^al or 
teacher. In this category we include the traditional norm-referenc«d 
standardized achievement testing programs that have been used by school 
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Tbero are two reasons the distinction betwwn Internal and external 
testing programs is Important First, external t^sts tend to have a greater 
mismatch with what Is taught at the local level than do internal tests; for the 
simple reason that curriculum and teaching methods are determined, either 
explicitly or implicitly at the "grass-roots level" - the LEA, tb^ ^ool and 
ultimately behind the classroom door. Second, external testing programs 
tend to have clearer consequences associated with them than do locally 
initiated and controlled tests. And as we argue in the next section the size of 
the stalces associated with testing programs is a key determinant o; the 
educational impact 

4. High States vs Low Stakes Testing Programs 

A test whose results are seen - rightly or wrongly - by students 
teachers, administrators, parents, or the general public, to make important 
decisions that immediately and directly impact on them are what we shall 
term high-stakes tests. High-stakes student tests can be norm- or criterion- 
referenced, internal or external in origin. Examples include tests directly 
. linked to such important decisions as: ( 1 ) graduation, promotion or 
placement of students; (2) the evaluation or rewarding of teachers or 
administrators; (3) the allocation of resources to schools or school districts; 
and(4) school or school system certification. In all of these examples, the 
perception of people that test results are linked to a high-stakes decision is 
in fact accurate. Policy makers have mandated that the results be used 
automatically to make such decisions. 

However, there are other uses of test results that do not actually 
impact immediately and directly on students but nonetheless, are generally 
perceived by people as involving high-stakes. For example, SAT and ACT 
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results are of 8econ<lary importance In admission decisions in the vast bulk 
of coUeges trying to fUl vacant seats In the face of adverse demographics. 
Individuals and school systems, nevertheless, very often act on the 
perception that these college admissions tests are of crucial and singular 
Importance, If not to individual students, at least to the public's pecepUon of 
the quality of schooling. Thus, we find high schools are Increasingly offering 
courses to prepare students to take these tests, and commercial coaching 
schools are doing a land office business. 

In contrast to a high stakes test, a low stakes test is one which is 
perceived as having no Important rewards or sanctions tied directly to test 
performance. Traditional school district standardized norm referenced 
testing programs, where results are reported to teachers, but there is no 
immediate, automatic decision linked to performance, are ejamples of this 
sort of testing. Teachers are free to ignore any results that they feel are 
discrepant from their own perceptions of students, and the results are not 
perceived by ttiem as being used to evaluated their performance. This does 
not mean that test results from such programs do not affect teachers' 
perceptions of stiidents, nor does it mean ttiat student placement decisions 
are not related to test performance. The important distinction is ttiat 
teachers, students, and parents do not perceive test performance as a direct 
vechicle of reward or sanction. 

In short, it is clear that "high stakes' testing has the greatest impact 
on schooling, regardless of whether ttie stakes are associated wiUi specific 
decisions made on tiie basis of the results, or wiUi tiie perceived importance 
of the tests. 
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ArguiiiMts Concornlng tli« Effects of Standardlzod T<^8tlIlg 

ir tb«do are conditions which can affect the degreee of impact of 
testing, what of the nature of the impact? Is It for good or m? Over the last 
decade a wide variety of arguments have been advanced concerning the 
positive and negative effects of standardized testing. These arguments have 
been advanced in a variety of fonuns, including professional Journals and 
meetings, the popular press, legislative debates, meetings of school 
governance 8gencie8> and in both federal and state courts. 

Without trying to document such myriad sources, it is useful to 
consider the nature of the arguments advanced regardijig the positive and 
negative effects of standardized achievement testing. Among the most 
common affirmative arguments have been that such testing: 

• helps focus instruction on skills (e{[. basic skills); 

• motivates students; 

• provides teachers with diagnostic information to improve 
instruction; 

• identifies curriculum areas in need of improvement; and 

• helps hold teachers and schools accountable for the learning of 
their charges. 

Among the most common charges levelled against standardized achievement 
tests are that they: 

• Are t>la8ed against certain kinds of students; 

• Do not match what students have been taught; 

• Constrain teachers' intitiative and creativity in teaching; 

• Promote leaching to the tests; 

• Narrow the curlculum. 
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from our review of this literature In the form of seven principles regarding 
the Impact of testing. 

Prindpto 1: 

The more anv quantitative sodal Indicator i& ^ised for social dedfiion- 
making, the more subject it will t>e to distort and comipt the social processes 
it is intended to monitof 

This Principle comes directly from Donald Campbell's work on social 
indicators. It is not limited to testing per se; t>elng much more|;eneral in 
scope, extending to any social indicator that is used to describe, make 
decisions about, or influence an important social process. This Principle 
reminds us that educational testing is a form of measurement subject to a 
social version of Helsenberg's uncertainty principle. Any measurement of the 
status an educational institution, no matter how well contrived, inevitably 
changes its status. And when the testing is used for Important sodal 
decisions, the change tends to be both large and corrupting. 
Prindpte 2 

The powor of tests and emtntnaMo ns to affect individuals, institutions 
ffirrlcumm or instrucUon is a perceptual phenomenon: if staidents. teachers. 
or admlni^^tors beUeve that the resuits of an ovaminfttj^ n are important it 
matters very litae whether this is reallv true or false - the of f Ad-. <a 
produced bv what individuals perceivft to be the case. 

Ben Bloom writing in the NSSE Yearbook coined this second 
Principle. Its importance lies in the fact that when people perceive a 
phenomeon to be true, their actions are guided by the importance perceived 
to be associated with it The greater the stakes perceived to be linked to test 
results the greater the impact on Instruction and learning. A high-stakes 
test is one where the results are seen by students, teachers, administrators, 
parents, or the general public, to be used to make important decisions that 
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lmme<Uately and directly impact on individuals with whom they are 
concerned. Examples of hlgh-stalces student tests Include those perceived to 
be directly linked to such important decisions as: (1) graduation, promotion 
or placement of students; (2) th$ evaluation or rewarding of teachers or 
administrators; (3) the allocation of resources to schools or school districts; 
and(4} school or school system certification. 
PRINCIPLE 3 

If Important decisions are presumed to be related to test results, 
then teach ers will teach to ttie test . 

This sort of accommodation by teachers to a high-stakes test is seen as 
having both positive and negative consequences. High-stakes tef>t8, it is 
often argued, can focus Instruction, giving students and teachers specific 
goals to attain. If the test is measuring basic skills, preparing students for 
the skills measured by the test could, proponents argue, serve as a powen w 
lever to improve basic skills. Unfortunately, however, the only sort of 
evidence generally available to t>ol8ter this proposition is that scores on high 
stakes tests do tend to increase over time. But standardized tests are 
' indirect measures of the real skills of interest, and what repeated e]q>erience 
shows is that there are many, many ways of raising test scores without 
changing the levels of the skills the tests are intended to measure. People 
too often fail to distinguish between the skill and the Indirect indicator of it 

If the test is specific to a specialized curriculum area, e.g. college 
preparatory physics, then the examination will eventually narrow 
instruction and learning, focusing only on those things measured by the 
tests. Indeed, this narrowing of the curriculum has been one of the enduring 
complaints leveled at external certification examinations used for the 
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Important functions of certifyiing the successful completion of elementary or 
secondary education, and admission to third level education and/or certain 
jobs. 

A review of the effects of such exams on the curriculum over many 
years, and in several countries, indicates that faced with a choice t>etween 
objectives which are e]q>licit in the curriculum or course outline, and a 
different set which are implicit in the certifying examination, students and 
teachers generally choose to focus on the latter. Spaulding's 1 93d report 
pointed out that teachers in New York disregarded the objectives in local, 
curriculum guides in favor of those tested in the Regents' examinations. 
Morris found the rigidity of the exams was the principal reason that the 
chemistry curriculum in Australia remained almost unchanged from 169 1 to 
1939. He concluded that the the proportion of instructional time spent on 
various aspects of the syllabus was "seldom higher than the predictive 
likelihood of its occurrence on the examination paper." Similar observations 
about the influence of the exams on the curriculum have been made in India, 
Japan, Ireland, and in England. Turner sums up the English eiqperience: 'X)ne 
only has to look at the timetable of the typical comprehensive school to see 
that the curriculum consists almost entirely of subjects which can be taken 
in public examinations." 

Why does this happen? First, there is tremendous social pressure on 
teachers to see to it that their students acquit themselves well on the 
certifying examinations. Second, the results of the examination are so 
important to students, teachers, and parents that their own self interest 
dictates that instructional time focus on test preparation. 
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As w& have indicated, however, a high-stakes test can f orve as a lever 
for the introduction of new curricular material. New curricula in physics, 
chemistry, and mat^^ematics made an immediate impact in Irish schools 
when in the early 1960'8 they were prescribed and examined for the 
Leaving Certificate Bominatlon. Primary teachers in Belgium accepted 
curricular reforms only when, in 1936/ the externa! exams given at the end 
of primary school were modified to incorporate the ideas of the new 
curriculum. In New Torit State, curriculum spedaiists from tha State ' 
Department of Education had little success in moving the emphasis in 
modern langjjiage teaching from grammar and translation to conversation 
and reading skills, until the corresponding changes had t^een Incorporated In 
the content of the Regents' examination. Revisions of the College Entrance 
Examination Board (CZEB) math achievement tests to include modem math 
played an important part in the introduction of "modem" mathematics 
curricula in the 1960*8. 

Despite the at>ility of high stakes examinations to help introduce new 
material, the weight of examination precedent strongly influences not just 
what Is taught and learned but also how. The question that educators must 
ask themselves is whether the positive aspects associated this phenomenon 
outweigh the disadvantages. The answer is a value judgment and depends 
on one's view of education, the learner, teaching, curriculum development 
and testing. Omr view of the "driving" of instruction via external 
examinations is that in the long term, given the nature of most commonly 
used tests, the narrowing of instruction and learning associated with this 
phenomenon far outweigh any advantages. 
Principle 4: 



17 



Ra«m>T<m»4 UtkOtntw, iMkfMHP fAT VJLSD Gtu4r CvAU^/ tQ(M»< t>. IS. 

In evflfv sotttnff whore a hiffh-statft^ 'O perates . a tfaditj^ n ^ [ 
emms has dev«loi»d and \t ftvfentaia«y cornea to dftflno Mio ^^rtlCTllTm illf 
^acto. 

Gl^en Principle 3/ the qufstiofi still romalns: "How do teach^fs cope 
with the pressure of the examination?" The answer is relatively simple. 
Teachers soe the kind of intellectual actiTity required by previous test 
questions and prepare the students to meet these d^^mands. Sc4ixe have 
argued strongly that if the skills are well chosen, and if the tests truly 
measure them, then coaching is perfectly acceptable. This argument sounds 
reasonable, and in the short term, it may even work. Bii v we need to take a 
longer view, because the argument ignores a fundamental fact of life: When 
the teacher's professional worth is estimated in terms of exam success, 
teachers have great incentive to produce gains; and they can do more simply 
by teaching test-taking strategies based on previous esam questions. 

Further, the e]q>ectation8 and deep-seated primary agenda of students 
and their parents fcAr emm success will further corrupt the process. The 
Tiew that we can coach for the skUla apart from the tradition of test 
questions is a staggeringly optimistic view of human nature which ignores 
tlie powerful pull of self-interest It simply doesnt consider t!ie long term 
effects of tbe examination sanctions. 

An interesting aspect of Principle 4 is that if the examination is 
perceived as important enough, a commercial industry develops, outside the 
schools, to prepare students for it In this country this phenomeon can be 
seen in the rise of commercial firms in virtually every major dty selling 
coaching services to students for the SAT college admissions test Another 
sign of the increasing prominence of test coaching in the United States is the 
fact that the phrase "test taking skills" first appeared as a separate indexing 
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category In volume 53 of tbe Edu^tloa fadex covering education literature 
for the July 1962 - June 1963 period. Most of the aitMes referenced under 
this new category dealt 'with improving admissi^ms test scores tt^^ough 
coaching provided by commercial or computer tutorial programs. In japan 
it is common for parents to enroll their children in special extra-^dy 
schools known as /uiv . Beginning ia the 1 9th century a whole industry of 
private coaching schools called "crammers" developed in Europe to prepare 
students, for a fee, for high stakes ezamic<;itions. The important point at>out 
these (caching schools is not whether or not they are successful in prepahtig 
students for the esam, it is instead, that the public perc^ves them as MipM 



and willing to pay for their services. 
Principle 3: 

Teachers and students pay partic ular attention to the form of the 
fluestlong on a high-stakes test eg. sh^^ answer, essav. multiDle choice ar^ ^ 
adjust their instruction accordingly. 

The problem here is that the form of the test question can narrow 
Instruction, study and learning to the detriment of other broader learning. 
Rentz recounts an example of this phenomenon which occurred as i\ result of 
the Georgia Regents' Testing Program, a program designed to assess 
minimum competencies in reading and writing on the part of college 
students in that state. The head of an English department lamented: 

Because we now are devoting our best efforts to getting the 
largest number of studente past the essay exam we are 
teaching to the exam, with an entire course, Englteh III, given 
over to developing one type of essay writing, the writing of a 
five- paragraph argumentative essay written under a time 
limit on a topic about which the author may or may not have 
knowledge, ideas, or personal opinions. Teaching this one useful 
writing skill has the beneficial effect of bringing large numbers 
of weak students to a minimal level of literacy, but at the same 
time, it devastates the content of the composition program that 
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Should be offering the better students challenges to produce 
vnitlng of high quality. Because the Regents Test is primarily 
designed to establish a minimal level of literacy, our teaching 
to this test> which its importance forces us to do, tends to make 
the minimal acceptable competency the goal of our institution, a 
circumstance that guarantees mediocrity. 

Principle 3 has profound implications for the curriculum specialists. 
Given our free enterprise system, publishers have begun to loolc at state 
mandated minfmnm competency or basic sldlls tests in order to design 
materials to better train pupils to take them. Children are apt therefore to 
find themselves spending more and more time filling out ditto ansv?er sheets 
or work books. Deborah Meier, a successful principal of a public school in 
Manhattan, testified at the 19d 1 NIE sponsored hearings on Minimum 
Competency Testing (MOT) that in New Tork aty, reading instruction has 
come to closely resemble the practice of taking reading tests. In reading 
class, students, using commercial materials, read dozens of little paragraphs 
about which they then answer multiple choice questions. Meier described the 
materials as evolving to resemble more and more the tests students will take 
in the spring. She went on to point out that when synonyms and antonyms 
were dropped from the New Tor City test of word meaning, teachers 
promptly dropped commercial material that stressed them. It is also 
interesting to note that in 1965 , sales of ditto paper were way up nationally 
while sales of lined theme paper were down. 

Principle 6: 

When test results are the sole or even partial ar biter of future 
educational or life choices, society tends to treat test results as the major 
goal of scheoH ng rather than as a useful but fallible indicator of achievement. 

Of all of the effects attributed to tests, this may well be the most 
damaging to education. It is illustrated in the following observation from a 
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nin«te«uth century British school inspector who observed first hand the 

negative effects of linking teacher salaries in England and Ireland to pupil 

eicamination results: 

Whenever the outward standard of reality (examination results) 
has established itself at the e]q>en8e of the inward, the ease 
with which worth (or what passes for such) can be measured is 
ever tending to become in itself the chief, if not sole, measure of 
worth. And in proportion as we tend to value the result? of 
education for their measurableness, so we tend to undervalue 
and at last to ignore those results which are tod intrinsically 
valuable to be measured. 

Sixty years later, Ralph Tyler echoed the same message in the 62&<I NSSB 

Yearbook when he warned readers that society conspires to treat marks in 

OMTtifying examinations as the major end of secondary schooling, rather than 

as a useful but not infallible indicator of student achievement 

We see the Importance society places on test scores to the exclusion of 

other indicators in such things as: the media attention to declines in SAT 

scores; reports that our schools score lower than those of other countries in 

math and science; the Education Department's wall chart that ranks states by 

their performance on the SAT or ACT; newspapers ranking school districts 

• and/or schools within districts by their performance on standardized tests; 

the use of test results by real estate agents in selling homes; the money 

spent by parents on coaching schools for the SATs; the list could goes on and 

on. 

Principle 7 

A high-stakes test transfers control over the curricu lum to the agency 
which sets or controls the exam. 

The agency responsible for a high-stakes test assumes a great deal of 
power or control over what is taught, how it is taught, what is learned and 
how it is learned. This phenomenon is well understood in Europe where a 

Er|c 21 



system of external certification examinations, controlled by the central 
government or by independent examination boards, operates at the 
iiecondary level. And while this shift in power is also understood in this 
country by policy maicers who are mandating graduation and promotion 
tests, the implications of the shift from the local educational authority (LEA) 
to the state department of education (SEA) has not received sufficient 
attention and discussion. Further, since most state level testing programs 
are developed and validated for SEAs by outside contractors, it is important 
to realize that the state may be effectively delegating this very real power to 
a commercial company whose interest is primarily financial and secondarily 
educational. 

Despite the negative consequences historically associated with 
Principles 1-7, the use of teste perceived as involving high-stakes, is growing 
in the United Stetes. Policy maicers are well aware of the high symbolic 
value of teste. By mandating a test they are seen to t>e making things 
happen. The test becomes a synecdoche for standards. The general public is 
gratified to find the teste restoring confidence in the schools. The numerical 
'scores from high-stakes teste have an objective, scientific, almost magical 
persuasiveness about them, that the general public and policy makers are 
quick to accept Curriculum specialiste, teachers, and administrators 
therefore, increasingly are forced to deal with the consequences of mandated 
high-stakes teste. Until recently, these teste have generally involved basic 
skills or minimum competencies. However, recently various education 
reform reporte have recommended a move to secondary school certification 
teste in individual subjecte that closely resemble the British '0' and A' 
examinations. 
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In order to minimize the negative side effects associated vnth the 
impact of such tests, and to maximize possible benefits, it is important that 
professionals understand the value and limitations that historically have 
been associated with their use. They also need to understand how testing 
and the use of tests is changing in the United States: we are moving from 
LEA control to SEA control. Further, ttiiere are proposals that would alter the 
gathering and reporting of NAEP data to permit inter and intra state 
comparisons. The full philosophical, political and educational issues 
associated with such shifts in the use of tests need to be better tmderstood 
and more fully debated. 

Certainly, testing has an important function to play in American 
education. Test results when used in conjunction with other data about pupil 
performance, can help teachers to improve their instruction and to make 
educationally sound decisions. Test results can also provide important 
independent information to parents, ttie public and school administrators 
about the schools. Much of this information to which they are entitled. 
What we need right now is an awareness of the limitations and fallibility of 
tests. Test scores need to t>e regarded as one Important element in decision 
malcing. The scores should be used by teachers and administrators in 
conjunction with other indicators of students' progress when malcing 
important decisions. 

IV. NAEP as an Instrument for Informing Educational 
PoUcy 

There are two basic ways in which test may affect educational policies. 
The first is the use of test information to imfinu policy makers about the 
current state of education. The second is the use of tests as ad^ninistrative 
devices in the implementation of policy. In the former case test results are 
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used to describe the present state of education or some aspect of it, or in 
lobbying efforts for new programs or for reform proposals. The effects of 
this informational, descriptive use of test results on the educational process 
are indirect . This is in sharp contrast to the administrative use of test 
results where by results automatically trigger a direct reward or sanction 
being applied to an individual, or institution. 

Tb« Us* of T«8t8 Rtsnlts to Infonn Policy 

The 1 667 Act establishing the Department of Education recognized the 
need for gathering descriptive information about "the condition and progress 
of education in the several states and terrritories." Of course, at that time 
testing as we now know it did not exist From the 1920's to the 1960's, 
standardized tests had Uttle or nothing to do with state or federal policy. It 
was not until the early 1960*8 with the passge of the Elementary and 
Secondary Education Act of 1963, and the establishment of The National 
Assessment of Educational Progress CNAEP), that the Department began 
systematically to gather test data as part of its original mandate. Further, 
state departments of education have only recently begun to systematically 
collect test data to describe the status of education.. 

Their were several reasons for this shift First, .the concept of equality 
of educational opportunity evolved from a concern about equality of Inputs, 
resources, and access to programs into a preoccupation with achieving 
equality of outcomes. As a result test scores began to be used as a primary 
indicator of educational outcomes. Second, advocates for minority groups 
began to point to the large discrepancies between the test scores of middle 
class students and their constituents to lobby successfully for compensatory 
funds for programs to reduce these disparities. Third, the large expenditures 
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la the 50'8 & 60*8 for curriculum development and compensatory programs 
led policy makers to asic for student test data as an indicator of the 
effectiveness of these programs. Fourth, as noted above NAHP was designed 
to provide a basis for public discussion and broader understanding of 
educational progress and problems of the nation. 

More recently, the numerous educational reform reports have used 
test results, including SAT and NAEP data, to bring to the attention of the 
country what they conclude to t>e the mediocre state of American education, 
as well as to lobby for improvement programs to redress these weaknesses. 
Test data dearly form an important t>a8is for the current negative 
descriptions of tte status of American academic education. The question is 
how valid are these Inferences? Is the academic performance of our 
students as poor as it is painted in the various reports? ¥niile there are 
weak spots - particularly at the secondary level with higher order skills - 
one could look at the same data and conclude that our schools are doing 
quite a creditable job; that declines and weak spots may t>e due in large part 
to non-school factors. In general the reports use test results in ways that 
accentuates the negative, eliminates the positive and leaves no room for Mr. 
In-Between. 

An illuminating escample of how these indicators are actually used to 
inform such reports is provided by the Twentieth Century Fund Report It 
opens with the following gloomy assertion. The nation's public schools are in 
trouble. By almost every mea8ure.Jie performance of our schools falls far 
short of expectations". However, in a commissioned background paper, 
published as an appendix to the Report itself, Peterson escamines all of the 
available indicators, including test scores, and concludes that, " Nothing in 



25 



Ran»y Mwd UaOaiao. {ukfkor far U^SD Ctuiy Croufk, tOSfi, f> . 33. 

these data permits the conclusloa that educational Institutions hare 
deteriorated t>adly." It would seem that the Task Force did not take 
cognizance of its own commissioned paper. 

Stedman and Smith In their excellent review of these ref ona proposals 
point out that they are qulntessentlally political documents. Testing 
evidence was used selectively to buttress arguments and evidencewas often 
ignored that might have lessened the impact of the message, stedman 
andSmith examined critically the way test score indicators were interpreted 
and concluded not only was interpretation often sloppy, but also that we 
have little in the way of valid, longitudinal national indicators of the 
academic p' wmance of students nationwide. NAEP, however, was cited as 
the single ^ r^on to the latter indictment 

V. The Future of NAEP 

In considering the future of NAEP, we agree with Stedman and Smith 
and other observers of the currently available indicators of the nation's 
educational health; NAEP is the most valid source of national data on our 
students general achievement level over time. In light of our prveious 
discussion, we note too that NAEP has clearly been designed to inform 
educational policy-making, not drive it This is in sharp contrast to many 
new testing programs implemented over the last decide at the state level 
which have been designed not Just as means for informing policy, but 
instead as vehicles to implfifflfiSt educational policies. For Instance, tests that 
are directly linked to decision-making about individual students (e.g. 
regarding grade to grade promotion, remediation or graduation), teachers 
(e.g. for teacher evaluation), schools or school systems ( e.g. used allocate 
funding or other resources, or to rank publicly schools, school systems or 
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even states) all are, at least in some degree, instruments that not only inform 
policy, but also implement it 

Despite the relatively high praise that NAEP has received as a source 
of valid data on the status of our nation's students, it also has been subjected 
to a variety of criticisms. For escample, the NAEP move from reporting reults 
at omy the item level to reporting results aggregated across items, it was 
criticized for a serious failure to collect and analyze appropriate construct 
validity evidence Haney, 19d2). It Is our und«trstanding, hovwver, that 
this problem has been ~ or at least is being » remedied via new latent trait 
scaling procedures designed by th& Educational Testing Service since it has 
taken over the administration of NAEP from the Education Commission of the 
States. 

A more common criticism of NAEP by almost every review of it over 
the last two decades has been that its results have not been t&rrlbly useful 
(e|[. Greenteium, Garetand Solomon, 1977, Kazlett; 1974, , GAO, 19?, Wirtz 
and Lapointe , Kaney 1962). In order to provide a rough check on the utility 
of NAEP over time we performed a search of the ERIC data base to see the 
number of occurrences of the e;q>re8slon 'national assessment' by year. To 
provide some perspective we also ascertained the total number of citations 
in the ERIC data base by year, and as a points of contrast with the search for 
"national assessment" Items, the number of occurrences, also by year of 
items containing the phrase 'educational assessment" Details of how we 
conducted this search and its limitations are described in Appendix 2, and 
the results are depicted in Figure 2 . 

Our findings suggest two major points. One is that in the citations 
from around 1970 ( specifically 1969, 1970, and 1971), the number of 
occurences of the phrase 'national asessment" eioceeded the number of 
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occurrences of the phrase "educational assessment" However, t>eglnnning in 
1973 the number of times the phrase "educational asessment" is found in the 
ERIC datat>a8e8 increases dramatically. This is evidence, we suspect of one of 
the indirect affects of NAEP, namely that of promoting broader attention in 
the educational research community to the idea of educational assessment 
More concretely, of course, the NAEP model also contributed in the 1970s to 
the founding of a number of statewide assessment programs. (Given the 
&^e of the« vertical axis of Figure 2, it is hard to discern the pattern of 
citations, for the pre- 1973 years, but readers interested in details may 
consult the data table in the appendix.) 

More recently, in the mid- 19608, the number of entries under both 
terms "educational assessment* and "national assessment" have decreased 
sharply, roughly one-third to one-half from what they were at their peaks 
(in 1962 the peak number for -national assessment" was 63, and for 
"educational assessment" the peak occurred in 1979 with ^96). In part these 
changes may simply represent the trend of enthusiasms and jargon in the 
broad field of educational reasearch and testing. For it seems apparent that 
the enthusiasm for "assessment" in the 19708 gave way to interest in 
"competency testing" and "basic skills testing" in the late 1970s and early 
19dOs. 

Nevertheless, the figures represented in Figure 2 give cause for 
concern about the continuing utility of NAEP. And it is against this 
background that we wish to consider ideas about the future of NAEP. 
Specifically, we wish to consider the idea of extending NAEP to be used in 
statewide testing programs, and the possibility of NAEP assessment exercises 
being used not just in sampling assessments but also in programs in which 
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ilattons of students are tested ( within a state or school 

lotion of using NAEP esoedses ( or Items) not just in national 
3 based on stratefled random samples, but also in statewide 
[rams in which all students ata particular grade might be tested 
lent economic sense. It would allow the considerable costs of 
It of NAEP ex&rcises to be amortized over much larger 
. However.whatwifr wish to warn against are three dangers 
Ath the possibility of using NAEP in statewide assessments 
rwhere whole populations of students rather than samples might 

angers we wish to warn against are : 1) If NAEP becomes an 
»f educational policy it likely would have distorting effects on 
ms; 2) If NAEP becomes an instrument of policy, it would 
> validity of NAEP findings, and 3) Such wider instrumental use 
^t threaten some of the indirect benefits which we think have 
tdd with NAEP in its first 20 years. 
lEP as an Instmm^At of edncattofial policy would bave 
»f fects. Here our worry is that !a the efforts to make NAEP 
if it develops into a program which would allow specific 
be made on the basis of NAEP results (or even state by state 
), it would change from what it has been over the last 20 years, 
rce of high quaUty information for iafcrmlflg educational policy, 
imsti^ policy. 

ft our concern is that to the extent that NAEP becomes an 
re device for implementing educational policy, the negative 
ibed in ttie seven principles above would come into play. We 
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Id that we know of no sp^dflc plans by anyone to use 
pedflc decisions about students (such as grade promotion 
wever, as we have argued, even comparisons such as state 
t on the basis of test results can lead to distortions. 
instrnfflOAt of policy, might bave less validity. 
I is that efforts to improve the utility of NAEP migth 
le validity cf assessment data derived from it Here, we 
as been one feature of NAEP universally piiised, it is the 
rmation derived from NAEP assessments (with the 
ytior to the ETS administration of NAEP, of the failure to 
lity evidence supporting the reporting of data on groups 
- but as we said, we think that this problem has been 
new ETS scaling procedures). However, if efforts to make 
lave the effect of making NAEP an administrative 
itional policy, this would likely not only have distorting 
lal practices, but also would threaten the validity of 
rived from NAEP. Another way of communicating this 
le distinction between obstrusive and unobtrusive 
' the real virtues of unobyrusive measurement is that 
usive it tends to have higher measurement validity than 
lent Unobstrusive measurement need not be unusef ul, 
ur view that educational testing and assessment works 
imtoward consequences when their effects are mediated 
id judgments of educational policymakers and 
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As have already noted, one of the relatively early Indirect benefits 
of NAEP seems to have t>een that In the mid- 19708, it helped promote the 
development of broader interest in educational assessment and specifically 
several statewide assessment programs. However, in the late 19703 as we 
also already noted, many state programs moved in the direction of 
"minimum competency" or "basic sldlls" testing. In this regard we feel that 
one of the real indirect benefits of NAfiP is that it has provided a model of 
broader assessment* broader both with regard to the range of knowledge 
and skills tested, and with regard to the methods of assessment used. 

In this regard and in thinking about the possible indirect effects of 
standardized testing on education in general and on curriculum specifically, 
two general points seem apparent One is that though most standardized 
tests may focus on reading and math skills, the general aims of education as 
represented in curriculum are considerably broader. Take for example the 
Ifew Basics" advocated by the National Commission on Excellence in 
Education (1963). This group, though only one of many groups recently 
advocating reforms of education, was perhaps the most prominent It 
recommended that 

state and local high school graduation requirements be 
strengthened and that at a minimum, all students seeking a 
high school diploma be required to lay the foundations in the 
Five New Basics by taking the following curriculum during the 
4 years of high school: (a) 4 years of English; (b) 3 years of 
math; (c) 3 y^ars of science; (d) 3 y^ars of social studies; and 
(e) one-half year of computer science. For the college-boimd, 
2 years of foreign language are strongly recommended in 
addition to those taken earlier. (National Commission on 
Excellence in Education, 1963, p. 24.) 
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irso only one of many recent recent reform 

for improving education In the United States. But a 

t this recommendation, in contrast to the coverage of most 

i, is that three of the five "new basics' are badly, if not 

ed in most statewide testing programs. This has not been. 

■ 

Leral feature of almost all large scale testing programs, 
dng perhaps the most notable exception, is that they 
»-cholce format almost exclusively. This format Is in 
the calls in many of the recent education reform reports 
J be taught not Just to solve pre-set problems, but to 
ut solutions to problems they find for themselves. In 
eminded of Norman Frederllcsen's article suggesting that 
f the multiple-choice format represents the real test 
eviews evidence showing that partlcullarly for higher 
ttg skills, Invloving solving open-ended or ill-structured 
:hoice items may not be good measures of the broad 
abilities that we might hope are taught in schools. Thus, 
>f tests employing the multiple-choice format^ which 
»d to recent interest in "test-taking skills," may be 
tly the wrong message about what it is that we want our 
g. In this regard it Is notable that Frederiksen closed his 
it bias with a message remarkably similar to the one 
een year earlier, quoted at the start of this article. As 

t bias" in my tiUe has to do with the influence of 
ling and learning. Efficient tests drive out less 
}, leaving many important abilities untested anbd 
Important task for those Involved in testing is to 
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develop instruments that will better reflect the whole domain 
of educational goals and to find ways to use them In 
Improving the educational process, (p. 19) 

In the future, we therefore hope that there will be maintained the 
past tradition of NAEP, of empioying diverse methods of assessment and 
helping to do what both Tyler and Frederiksen advocate - namely making 
sure that our assessments span as broad a range of the curriulums and goals 
of our schools as possible. 
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Appendix 1 

The Education In<hi». pubUfhod by the H. V. VUfon Compeny fince the 1932 U on author 
ofiid fubject index to educational material in the English lenguege. Thotieh primarily a 
periodi<»l index it alM <»vers procoedinss, yearbooks, monographi and materiel 
printed by the US. GoYomment. In terms of periodical corere^e, the index covers both 
profBssional and more popular journals. Actuel selection of pwiodicals for indexing Is 
accomplished by subscriber TOte, at least since 1970. In Toting , subscribers era asked to 
place primary emphasis on the reference value of periodicals under consideration. 

Ei^ire 1 w constructed simply by meesuring the number of column inches devoted 
to listings concerning testing and curriculum in every volume of the Index froml9^ 
(volume 1. covered material published from January 1929 - June 1932} through 198} 
(volume 35; JUly 1964 - Juno 1963). Over these volumes there vere ocassionelly some 
changes In the index rubrics concerning testing end curriculum. Hovever the 
primary rubrics considered es to pertaining to the tvo topics of interest vere: 

Testing: 

Tests and Scales 

Testing programs ( introduced in vol. 6} 

Tests of general adueationel development (vol. 23) 

Testing instruments (vol. 25) 

Curriculum: 

Curriculum 
Curriculum Making 

Curriculum laboratories ( introduced in vol. 4) 
Curriculum sadsfBiction (vol. 0) 
Curriculum selection (vol. 6) 

Curriculum development ( replacing Curriculum Making in vol. 1 1 ) 
Curriculum studies (vol. 16) 

It should be noted too that only since 1964 (volume 14 covering 7/63-7/64) has the 
.I&llat>een produced ennually. Before that time it vas Issued on either biennial or 
triennial bases. Thus for years prior to 1964 %e have shoin the averay e annual 
numbers of column inches listed under the relevant testing end curriculum rubrics, 
rinelly it is vorth mentioning that over all 35 volumes of the Jndsx, the type size has 
remeinod the seme and that 1 column inch, including headings and subheadings, 
amot^ts on average to roughly 2 to 3 references. 
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ERIC REFERENCES CONCERNING ASSESSKfiNT 



ERIC Citations. Total. Regarding 

Ed'l Asa't. Nat'l Aea't. By Year 

Year Total Ref'a Ed'l Aee't Nafl Ass't Ratio # 

65 2959 1 0 0.00 

66 4856 5 1 0.20 

67 7115 6 0 0.00 

68 9624 9 2 0.22 

69 25317 15 16 1.07 

70 28954 26 30 1.15 

71 32499 36 37 1.03 

72 34063 68 35 0.51 

73 35208 164 50 0. 30 

74 35274 453 50 0.11 

75 38168 545 59 0.11 

76 37417 489 76 0.16 

77 36914 528 82 0.16 

78 38471 564 75 0. 13 

79 35984 598 64 0.11 

80 34591 548 66 0. 12 

81 31516 436 80 0.18 

82 30661 293 83 0.28 

83 30506 290 64 0.22 

84 29591 306 45 0. 15 

85 24275 275 44 0. 16 



